Explore and Summarize Red Wine Data by Luis Cruz

In this project a wine dataset is used. It is related to red variant of the Portuguese “Vinho Verde” wine. The data is available here. A description of this dataset can be found here.

According to this description there are no Null values. A description of the attributes is also given:

1 - fixed acidity (tartaric acid - g / dm^3): most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity (acetic acid - g / dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid (g / dm^3): found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar (g / dm^3): the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides (sodium chloride - g / dm^3): the amount of salt in the wine

6 - free sulfur dioxide (mg / dm^3): the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide (mg / dm^3): amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density (g / cm^3): the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates (potassium sulphate - g / dm3): a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant

11 - alcohol (% by volume): the percent alcohol content of the wine

12 - quality (score between 0 and 10) – Output variable (based on sensory data)

Dataset Overview

## [1] 1599   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

This overview shows that all the attributes are numerical. Also we can see that the wines in this study have relatively low alcohol, since 75% of wines have between 8.40 and 11.10 by volume. This is a common feature in this kind of wine, “Vinho Verde”. Properties such as residual.sugar, chlorides, free.sulfur.dioxide and total.sulfur.dioxide seem to have outliers since the maximum is much higher than the 75% percentile.

Univariate Plots Section

Quality

## Warning: Removed 2 rows containing missing values (geom_bar).

Alcohol

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

There are some outliers below 9% and above 14% alcohol by volume. Most wines are between 9% and 12%.

pH

## Warning: Removed 2 rows containing missing values (geom_bar).

Fixed acidity

## Warning: Removed 2 rows containing missing values (geom_bar).

Sulphates

## Warning: Removed 2 rows containing missing values (geom_bar).

Most wines have sulphates between 0.4 and 0.9\(g/dm^3\).

Univariate Analysis

What is the structure of your dataset?

There are 1599 different wines with 13 variables, including column X, which is just an index variable.

The 13 columns in this dataset are the following:

  • X
  • fixed.acidity
  • volatile.acidity
  • citric.acid
  • residual.sugar
  • chlorides
  • free.sulfur.dioxide
  • total.sulfur.dioxide
  • density
  • pH
  • sulphates
  • alcohol
  • quality

They are all numeric

What is/are the main feature(s) of interest in your dataset?

The main feature of interest in this dataset is quality. It would also be interesting if there was a feature for price. Unfortunately, it is not easy identify which wine is being referred in data, and this information is not available.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I am expecting that the chemical composition (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, sulphates, alcohol) of the wine have an impact in the quality of the wine.

I am also curious to see whether a good wine needs to have a high percentage of alcohol by volume. Or whether a good wine is correlated with the amount of sugar. Since sulphates are an additive I am expecting it to be lower in high quality wines.

Did you create any new variables from existing variables in the dataset?

So far, I didn’t find useful to extract new features from this dataset. ### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Bivariate Plots Section

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
##                         sulphates     alcohol     quality
## fixed.acidity         0.183005664 -0.06166827  0.12405165
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778
## citric.acid           0.312770044  0.10990325  0.22637251
## residual.sugar        0.005527121  0.04207544  0.01373164
## chlorides             0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029
## density               0.148506412 -0.49617977 -0.17491923
## pH                   -0.196647602  0.20563251 -0.05773139
## sulphates             1.000000000  0.09359475  0.25139708
## alcohol               0.093594750  1.00000000  0.47616632
## quality               0.251397079  0.47616632  1.00000000

The only features that have correlation with quality are alcohol (0.476) and volatile.acidity (-0.391). The positive correlation with alcohol is really interesting.

Although I suggested that sulphates could be negatively correlated with quality, it is not evident in this analysis.

The following sections provide scatter plots of quality with other variables.

Quality vs. Alcohol

## 
##  Pearson's product-moment correlation
## 
## data:  df$alcohol and df$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?

Quality vs. Volatile Acidity

## 
##  Pearson's product-moment correlation
## 
## data:  df$volatile.acidity and df$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

Other interesting relationships

Density vs. Fixed Acidity

## 
##  Pearson's product-moment correlation
## 
## data:  df$density and df$fixed.acidity
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6399847 0.6943302
## sample estimates:
##       cor 
## 0.6680473

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The quality in this dataset presents positive correlation with alcohol (0.476) and a negative correlation with volatile acidity.

The positive correlation with alcohol is interesting. Perhaps more knowledge about this kind of wine is needed to explain this relationship.

The negative correlation with volatile acidity is in line with the fact that high levels of volatile acidity concentration give a vinegar taste to the wine.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There are in fact some other relationships but none seem to add interesting information in our problem.

Density is negatively correlated with alcohol. But that was already expected, since alcohol has lower density than water (yes, wine has water in its composition :-)).

Citric acid is correlated with fixed acidity, which is expected since citric acid is a fixed acid.

What was the strongest relationship you found?

The strongest relation was between pH and fixed acidity (\(-0.682\)). pH measures the acidity of the wine. Since the acidity is given by fixed acidity and volatile acidity the sum of both is expected to be highly correlated with pH. Volatile acidity does not have much impact on pH since its values are much lower when compared to fixed acidity, as it was shown above in the summary of the dataset (median for fixed acidity is \(7.90\) and while for volatile acidity is \(0.52\)).

This is interesting because although volatile acidity is negatively correlated with the quality, it is not recognizable by the acidity of the wine.

Multivariate Plots Section

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The plots with higher absolute correlation with quality are volatile.acidity and alcohol. The following plots investigate this relationship.

This last plot shows that for quality levels 5 and 6 the points are spread by a large region. Plots for the other quality levels show that there is a distinct pattern.

Perhaps combining the quality into 3 categories might help:

It did not help. Let’s go back to the quality variable analysis and try to add a fourth variable.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

Plot Two

Description Two

Plot Three

Description Three


Reflection